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(«— «, I The purpose of model selection algorithms such as All Subsets, 

Forward Selection and Backward Elimination is to choose a linear 
model on the basis of the same set of data to which the model will 
be applied. Typically we have available a large collection of possi- 
ble covariates from which we hope to select a parsimonious set for 
t/J I the efficient prediction of a response variable. Least Angle Regres- 

(~| ■ sion (LARS), a new model selection algorithm, is a useful and less 

greedy version of traditional forward selection methods. Three main 
properties are derived: (1) A simple modification of the LARS algo- 
rithm implements the Lasso, an attractive version of ordinary least 
squares that constrains the sum of the absolute regression coefficients; 
^vj . the LARS modification calculates all possible Lasso estimates for 

^ ' a given problem, using an order of magnitude less computer time 

\^ , than previous methods. (2) A different LARS modification efficiently 

1/^ ' implements Forward Stagewise linear regression, another promising 

^ \ new model selection method; this connection explains the similar nu- 

merical results previously observed for the Lasso and Stagewise, and 
helps us understand the properties of both methods, which are seen 
as constrained versions of the simpler LARS algorithm. (3) A sim- 
ple approximation for the degrees of freedom of a LARS estimate is 
i^H I available, from which we derive a Cp estimate of prediction error; 

C^ ■ this allows a principled choice among the range of possible LARS 

C ' estimates. LARS and its variants are computationally efficient: the 

. . , paper describes a publicly available algorithm that requires only the 

^ ' same order of magnitude of computational effort as ordinary least 

• ^^ 
lv>( _ squares applied to the full set of covariates. 
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1. Introduction. Automatic model-building algorithms are familiar, and 
sometimes notorious, in the linear model literature: Forward Selection, Back- 
ward Elimination, All Subsets regression and various combinations are used 
to automatically produce "good" linear models for predicting a response y 
on the basis of some measured covariates xi,rE2, . . . ,Xm. Goodness is often 
defined in terms of prediction accuracy, but parsimony is another important 
criterion: simpler models are preferred for the sake of scientific insight into 
the X — y relationship. Two promising recent model-building algorithms, the 
Lasso and Forward Stagewise linear regression, will be discussed here, and 
motivated in terms of a computationally simpler method called Least Angle 
Regression. 

Least Angle Regression (LARS) relates to the classic model-selection 
method known as Forward Selection, or "forward stepwise regression," de- 
scribed in Weisberg [(1980), Section 8.5]: given a collection of possible predic- 
tors, we select the one having largest absolute correlation with the response 
y, say Xjj, and perform simple linear regression of y on Xj^. This leaves a 
residual vector orthogonal to Xj-^, now considered to be the response. We 
project the other predictors orthogonally to Xj-^ and repeat the selection 
process. After k steps this results in a set of predictors Xj^ ,Xj2, ■ ■ ■ , Xj^ that 
are then used in the usual way to construct a A:-parameter linear model. For- 
ward Selection is an aggressive fitting technique that can be overly greedy, 
perhaps eliminating at the second step useful predictors that happen to be 
correlated with Xj-^. 

Forward Stagewise, as described below, is a much more cautious version 
of Forward Selection, which may take thousands of tiny steps as it moves 
toward a final model. It turns out, and this was the original motivation for 
the LARS algorithm, that a simple formula allows Forward Stagewise to be 
implemented using fairly large steps, though not as large as a classic Forward 
Selection, greatly reducing the computational burden. The geometry of the 
algorithm, described in Section 2, suggests the name "Least Angle Regres- 
sion." It then happens that this same geometry applies to another, seemingly 
quite different, selection method called the Lasso [Tibshirani (1996)]. The 
LARS-Lasso-Stagewise connection is conceptually as well as computation- 
ally useful. The Lasso is described next, in terms of the main example used 
in this paper. 

Table 1 shows a small part of the data for our main example. 

Ten baseline variables, age, sex, body mass index, average blood pressure 
and six blood serum measurements, were obtained for each of n = 442 dia- 
betes patients, as well as the response of interest, a quantitative measure of 
disease progression one year after baseline. The statisticians were asked to 
construct a model that predicted response y from covariates xi, 3:2, . . . , xiq. 
Two hopes were evident here, that the model would produce accurate base- 
line predictions of response for future patients and that the form of the model 
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would suggest which covariates were important factors in disease progres- 
sion. 

The Lasso is a constrained version of ordinary least squares (OLS). Let xi, X2, 
. . . , Xm be n-vectors representing the covariates, ?ti = 10 and n = 442 in the 
diabetes study, and let y be the vector of responses for the n cases. By 
location and scale transformations we can always assume that the covari- 
ates have been standardized to have mean and unit length, and that the 
response has mean 0, 

n n n 

(1-1) J2yi = ^^ J2^ii = ^^ J2^^j = ^ for j = l,2,...,m. 

i=l i=l j=l 

This is assumed to be the case in the theory which follows, except that 
numerical results are expressed in the original units of the diabetes example. 
A candidate vector of regression coefficients /3 = (/3i,/32, • • • ,/3m)' gives 
prediction vector p,, 

m 
(1.2) Jl = ^Xj(3j=Xl3 [X„xm = (xi,X2,...,Xm)] 

with total squared error 

n 

(1-3) S0) = \\y-nf = Y,{yi-fl,)\ 

Let r(/3) be the absolute norm of /3, 

m 

(1.4) T(3) = El^il- 

i=i 

Table 1 

Diabetes study: 442 diabetes patients were measured on 10 baseline variables; a prediction 

model was desired for the response variable, a measure of disease progression one year 

after baseline 
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BMI 


BP 
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6 


23 
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4.6 
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FlG. 1. Estimates of regression coefficients /3j , j = 1,2,. 

fLeft panel^ Lasso estimates, as a function of t — '^ . |/3j| 

gression equation sequentially as t increases, in order j = 3, 9, 4, 7, . . . , 1 . fRight panel^ The 

same plot for Forward Stagewise Linear Regression. The two plots are nearly identical, hut 

differ slightly for large t as shown in the track of covariate 8. 



The Lasso chooses (3 by niiniinizing S{(3) subject to a bound t on T(/3), 



(1.5) 



Lasso: minimize S{(3) subject to T^fB) < t. 



Quadratic programming techniques can be used to solve (1.5) though we wiU 
present an easier method here, closely related to the "homotopy method" of 
Osborne, Presnell and Turlach (2000a). 

The left panel of Figure 1 shows all Lasso solutions (3{t) for the diabetes 
study, as t increases from 0, where /3 = 0, to i = 3460.00, where (3 equals the 
OLS regression vector, the constraint in (1.5) no longer binding. We see that 
the Lasso tends to shrink the OLS coefficients toward 0, more so for small val- 
ues of t. Shrinkage often improves prediction accuracy, trading off decreased 
variance for increased bias as discussed in Hastie, Tibshirani and Friedman 
(2001). 

The Lasso also has a parsimony property: for any given constraint value 
t, only a subset of the covariates have nonzero values of (3j. At t = 1000, for 
example, only variables 3, 9, 4 and 7 enter the Lasso regression model (1.2). 
If this model provides adequate predictions, a crucial question considered in 
Section 4, the statisticians could report these four variables as the important 
ones. 

Forward Stagewise Linear Regression, henceforth called Stagewise, is an 
iterative technique that begins with /i = and builds up the regression 
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function in successive small steps. If /i is the current Stagewise estimate, let 
c(/i) be the vector of current correlations 

(1.6) c = c(A) = X'(y-/i), 

so that Cj is proportional to the correlation between covariate Xj and the 
current residual vector. The next step of the Stagewise algorithm is taken 
in the direction of the greatest current correlation, 

(1.7) j = argmax|cj| and /i — > /i + e • sign(c-.) • x~., 

with £ some small constant. "Small" is important here: the "big" choice 
e = \c~-\ leads to the classic Forward Selection technique, which can be overly 
greedy, impulsively eliminating covariates which are correlated with x--. The 
Stagewise procedure is related to boosting and also to Friedman's MART al- 
gorithm [Friedman (2001)]; see Section 8, as well as Hastie, Tibshirani and Friedman 
[(2001), Chapter 10 and Algorithm 10.4]. 

The right panel of Figure 1 shows the coefficient plot for Stagewise applied 
to the diabetes data. The estimates were built up in 6000 Stagewise steps 
[making e in (1.7) small enough to conceal the "Etch-a-Sketch" staircase 
seen in Figure 2, Section 2]. The striking fact is the similarity between the 
Lasso and Stagewise estimates. Although their definitions look completely 
different, the results are nearly, but not exactly, identical. 

The main point of this paper is that both Lasso and Stagewise are variants 
of a basic procedure called Least Angle Regression, abbreviated LARS (the 
"S" suggesting "Lasso" and "Stagewise"). Section 2 describes the LARS 
algorithm while Section 3 discusses modifications that turn LARS into Lasso 
or Stagewise, reducing the computational burden by at least an order of 
magnitude for either one. Sections 5 and 6 verify the connections stated in 
Section 3. 

Least Angle Regression is interesting in its own right, its simple structure 
lending itself to inferential analysis. Section 4 analyzes the "degrees of free- 
dom" of a LARS regression estimate. This leads to a Cp type statistic that 
suggests which estimate we should prefer among a collection of possibilities 
like those in Figure 1. A particularly simple Cp approximation, requiring 
no additional computation beyond that for the /3 vectors, is available for 
LARS. 

Section 7 briefly discusses computational questions. An efficient S pro- 
gram for all three methods, LARS, Lasso and Stagewise, is available. Sec- 
tion 8 elaborates on the connections with boosting. 

2. The LARS algorithm. Least Angle Regression is a stylized version 
of the Stagewise procedure that uses a simple mathematical formula to ac- 
celerate the computations. Only m steps are required for the full set of 



6 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI 

solutions, where m is the number of covariates: m = 10 in the diabetes ex- 
ample compared to the 6000 steps used in the right panel of Figure 1. This 
section describes the LARS algorithm. Modifications of LARS that produce 
Lasso and Stagewise solutions are discussed in Section 3, and verified in Sec- 
tions 5 and 6. Section 4 uses the simple structure of LARS to help analyze 
its estimation properties. 

The LARS procedure works roughly as follows. As with classic Forward 
Selection, we start with all coefficients equal to zero, and find the predictor 
most correlated with the response, say Xj-^ . We take the largest step possible 
in the direction of this predictor until some other predictor, say Xj^, has 
as much correlation with the current residual. At this point LARS parts 
company with Forward Selection. Instead of continuing along Xj-^, LARS 
proceeds in a direction equiangular between the two predictors until a third 
variable Xj^ earns its way into the "most correlated" set. LARS then pro- 
ceeds equiangularly between Xj^^Xj^ and Xj^, that is, along the "least angle 
direction," until a fourth variable enters, and so on. 

The remainder of this section describes the algebra necessary to execute 
the equiangular strategy. As usual the algebraic details look more compli- 
cated than the simple underlying geometry, but they lead to the highly 
efficient computational algorithm described in Section 7. 

LARS builds up estimates Ji = X(3, (1.2), in successive steps, each step 
adding one covariate to the model, so that after k steps just k of the /3j's 




Fig. 2. The LARS algorithm in the case ofm — 2 covariates; y2 is the projection of y 
into £(xi,X2). Beginning at /Xq = 0, the residual vector y2 — /io ^'^^ greater correlation 
with xi than X2; the next LARS estimate is /ij = /Xq +71X1, where 71 is chosen such that 
y2 — /ij bisects the angle between xi and X2; then /Xj = /ii + 72U2, where U2 is the unit 
bisector; /Xj — y-z m the case m = 2, but not for the case m > 2; see Figure 4. The staircase 
indicates a typical Stagewise path. Here LARS gives the Stagewise track as e ^ 0, but a 
modification is necessary to guarantee agreement in higher dimensions; see Section 3.2. 
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are nonzero. Figure 2 illustrates the algorithm in the situation with m = 2 
covariates, X = (xi,X2). In this case the current correlations (1-6) depend 
only on the projection y2 of y into the linear space C{X) spanned by xi 
and X2, 

(2.1) c{fi)=X'{y-n)=X'{y2-n)- 

The algorithm begins at /Xq = [remembering that the response has had 
its mean subtracted off, as in (1.1)]. Figure 2 has y2 — Ao making a smaller 
angle with xi than X2, that is, ci(^q) > C2(/io)- LARS then augments /Xq in 
the direction of xi , to 

(2.2) /Ii = /Xo + 7ixi. 

Stagewise would choose 71 equal to some small value e, and then repeat the 
process many times. Classic Forward Selection would take 71 large enough to 
make /li equal yi, the projection of y into >C(xi). LARS uses an intermediate 
value of 71, the value that makes y2 — /i) equally correlated with xi and X2; 
that is, y2 — fii bisects the angle between xi and X2, so ci(/ii) = C2(/Xi). 

Let U2 be the unit vector lying along the bisector. The next LARS esti- 
mate is 

(2.3) /i2 = /ii + 72U2, 

with 72 chosen to make fl.2 — Y'i i^ ^^^ case m = 2. With m> 2 covariates, 
72 would be smaller, leading to another change of direction, as illustrated 
in Figure 4. The "staircase" in Figure 2 indicates a typical Stagewise path. 
LARS is motivated by the fact that it is easy to calculate the step sizes 
71,72, .. . theoretically, short-circuiting the small Stagewise steps. 

Subsequent LARS steps, beyond two covariates, are taken along equian- 
gular vectors, generalizing the bisector U2 in Figure 2. We assume that the 
covariate vectors xi , X2 , . . . , x^ are linearly independent. For A a subset of 
the indices {1,2, . . . , m}, define the matrix 

(2.4) Xa = {--- Sjy.j ■ ■ ■)j(:A: 
where the signs Sj equal ±1. Let 

(2.5) gj^ = X'^Xj^ and A^ = {V^g^H^)-^l\ 

1^ being a vector of I's of length equaling |^|, the size of A. The 

(2.6) equiangular vector vij\ = X_a.wj\, where w^^ = A^G^ 1^, 

is the unit vector making equal angles, less than 90°, with the columns of 

(2.7) X'^ma = Aj^1a and ||u^f = l. 
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We can now fully describe the LARS algorithm. As with the Stagewise 
procedure we begin at /ig = and build up p, by steps, larger steps in the 
LARS case. Suppose that /i^ is the current LARS estimate and that 

(2.8) c = X'(y-/i^) 

is the vector of current correlations (1.6). The active set A is the set of indices 
corresponding to covariates with the greatest absolute current correlations, 

(2.9) C = max{|cj|} and A= {j ■.\cj\ = C}. 

Letting 

(2.10) Sj = signjcj} for j G A, 

we compute Xj\^,Aj[ and u^ as in (2.4)-(2.6), and also the inner product 
vector 

(2.11) a = X'u^. 

Then the next step of the LARS algorithm updates /x^, say to 

(2.12) /x^^ =/x^ + 7u^, 
where 

^^1Q^ - ■ +f C-Cj C + Cj \ 

(2.13) 7=mm+<^- ^—,- '—}; 

j<^A- [Aj^-aj AA + aj] 

"min+" indicates that the minimum is taken over only positive components 
within each choice of j in (2.13). 

Formulas (2.12) and (2.13) have the following interpretation: define 

(2.14) /x(7) = /i^ + 7u^, 
for 7 > 0, so that the current correlation 

(2.15) Cj(7) = x^-(y - /z(7)) = Cj - 70^. 
For j G A, (2.7)-(2.9) yield 

(2.16) |c,(7)| = C-7^^, 

showing that all of the maximal absolute current correlations decline equally. 
For j € A^, equating (2.15) with (2.16) shows that 0^(7) equals the maximal 
value at 7 = ((7 — Cj)/{Aj[ — aj). Likewise —Cj{'~f), the current correlation 
for the reversed covariate — Xj, achieves maximality at {C + Cj)/{A_/[ + aj). 
Therefore 7 in (2.13) is the smallest positive value of 'y such that some new 
index j joins the active set; j is the minimizing index in (2.13), and the 
new active set A+ is AU {j}; the new maximum absolute correlation is 
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Figure 3 concerns the LARS analysis of the diabetes data. The com- 
plete algorithm required only ?7i = 10 steps of procedure (2.8)-(2.13), with 
the variables joining the active set A in the same order as for the Lasso: 
3,9,4,7,...,!. Tracks of the regression coefficients (3j are nearly but not 
exactly the same as either the Lasso or Stagewise tracks of Figure 1. 

The right panel shows the absolute current correlations 



(2.17) 



\Ckj\ 



|x'(y-/ife-i) 



for variables j = 1, 2, . . . , 10, as a function of the LARS step k. The maximum 
correlation 



(2.18) 



Ck = max{|cfcj|} = Ck-i - jk~iAk-i 



declines with k, as it must. At each step a new variable j joins the active set, 
henceforth having \ckj\ = Ck- The sign Sj of each Xj in (2.4) stays constant 
as the active set increases. 

Section 4 makes use of the relationship between Least Angle Regression 
and Ordinary Least Squares illustrated in Figure 4. Suppose LARS has just 
completed step k — 1, giving /ifc_i, and is embarking upon step k. The active 
set Ak, (2.9), will have k members, giving Xi^,Qk,Ai^ and u,fc as in (2.4)-(2.6) 
(here replacing subscript A with "A;"). Let y^ indicate the projection of y 
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Fig. 3. LARS analysis of the diabetes study: (leit) estimates of regression coefficients f3j, 
j — 1,2, ... , 10; plotted versus X] |/3j | ; plot is slightly different than either Lasso or Stage- 
wise, Figure 1; (rightj absolute current correlations as function of LARS step; variables 
enter active set {2.9) in order 3,9,4, 7, ..., 1; heavy curve shows maximum current corre- 
lation Ck declining with k. 
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into C[Xk)., which, since Jik^i G £(Xyfc„i), is 
(2.19) y, = /i,_i + XkG^^X'^{y - Ji^_{, 



, Ck 



the last equality fohowing from (2.6) and the fact that the signed current 
correlations in Ak all equal C^, 

(2.20) X^(y-/ife_i) = Ca.A. 

Since u^ is a unit vector, (2.19) says that y^ — p-^-i ^^^ length 



(2.21) 



Ik 



Ck 

~Ak 



Comparison with (2.12) shows that the LARS estimate /x^i. lies on the line 
from fik-i to y/c, 



(2.22) 



^^k - Affc-i 



Ik I- 

— [Yk 

Ik 



t^k-l; 



It is easy to see that 7/;, (2.12), is always less than 7^, so that p,j^ lies closer 
than y^ to p^k-i- Figure 4 shows the successive LARS estimates /x^ always 
approaching but never reaching the OLS estimates y^. 

The exception is at the last stage: since Am contains all covariates, (2.13) 
is not defined. By convention the algorithm takes jm = 7m = Cm/ Am, mak- 
ing ju^ = Ym and /^^ equal the OLS estimate for the full set of m covariates. 

The LARS algorithm is computationally thrifty. Organizing the calcula- 
tions correctly, the computational cost for the entire m steps is of the same 
order as that required for the usual Least Squares solution for the full set 
of m covariates. Section 7 describes an efficient LARS program available 
from the authors. With the modifications described in the next section, this 
program also provides economical Lasso and Stagewise solutions. 




Fig. 4. At each stage the LARS estimate /ij. approaches, but does not reach, the corre- 
sponding OLS estimate yu ■ 
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3. Modified versions of Least Angle Regression. Figures 1 and 3 show 
Lasso, Stagewise and LARS yielding remarkably similar estimates for the di- 
abetes data. The similarity is no coincidence. This section describes simple 
modifications of the LARS algorithm that produce Lasso or Stagewise esti- 
mates. Besides improved computational efficiency, these relationships eluci- 
date the methods' rationale: all three algorithms can be viewed as moder- 
ately greedy forward stepwise procedures whose forward progress is deter- 
mined by compromise among the currently most correlated covariates. LARS 
moves along the most obvious compromise direction, the equiangular vector 
(2.6), while Lasso and Stagewise put some restrictions on the equiangular 
strategy. 

3.1. The LARS-Lasso relationship. The full set of Lasso solutions, as 
shown for the diabetes study in Figure 1, can be generated by a minor mod- 
ification of the LARS algorithm (2.8)-(2.13). Our main result is described 
here and verified in Section 5. It closely parallels the homotopy method in 
the papers by Osborne, Presnell and Turlach (2000a, b), though the LARS 
approach is somewhat more direct. 

Let /3 be a Lasso solution (1.5), with Ji = X(3. Then it is easy to show 
that the sign of any nonzero coordinate (3j must agree with the sign Sj of 
the current correlation Cj = x'- (y — /i) , 

(3.1) sign (/3j ) = sign ( % ) = Sj ; 

see Lemma 8 of Section 5. The LARS algorithm does not enforce restriction 
(3.1), but it can easily be modified to do so. 

Suppose we have just completed a LARS step, giving a new active set A 
as in (2.9), and that the corresponding LARS estimate /i_4 corresponds to 
a Lasso solution /i = X(3. Let 

(3.2) WA = Aj^g^Hj^, 

a vector of length the size of A, and (somewhat abusing subscript notation) 
define d to be the TTi-vector equaling SjWj^j for j (z A and zero elsewhere. 
Moving in the positive 7 direction along the LARS line (2.14), we see that 

(3.3) /i(7) = X/3(7) , where /3, (7) = (3^ + jdj 
for j S A. Therefore /3j (7) will change sign at 

(3.4) ^. = _^^./d„ 
the first such change occurring at 

(3.5) 7 = min{7j}, 

7j>0 
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say for covariate x^, 7 equals infinity by definition if tliere is no 7j > 0. 

If 7 is less than 7, (2.13), then /3j(7) cannot be a Lasso solution for 7 > 7 
since the sign restriction (3.1) must be violated: /3^(7) has changed sign while 
C'A'y) has not. [The continuous function 0^(7) cannot change sign within a 
single LARS step since |cj(7)| =C - jA_a > 0, (2.16).] 

Lasso modification. If 7 < 7, stop the ongoing LARS step at 7 = 7 
and remove j from the calculation of the next equiangular direction. That 
is, 

(3.6) Ai_4^ = A^ + 7u^ and A+ = A-{j} 
rather than (2.12). 

Theorem 1. Under the Lasso modification, and assuming the ^^one at 
a time" condition discussed below, the LARS algorithm yields all Lasso so- 
lutions. 

The active sets A grow monotonically larger as the original LARS algo- 
rithm progresses, but the Lasso modification allows A to decrease. "One at 
a time" means that the increases and decreases never involve more than a 
single index j. This is the usual case for quantitative data and can always 
be realized by adding a little jitter to the y values. Section 5 discusses tied 
situations. 

The Lasso diagram in Figure 1 was actually calculated using the modified 
LARS algorithm. Modification (3.6) came into play only once, at the arrowed 
point in the left panel. There A contained all 10 indices while A+ = A — {7}. 
Variable 7 was restored to the active set one LARS step later, the next and 
last step then taking f3 all the way to the full OLS solution. The brief 
absence of variable 7 had an effect on the tracks of the others, noticeably 
Ps ■ The price of using Lasso instead of unmodified LARS comes in the form 
of added steps, 12 instead of 10 in this example. For the more complicated 
"quadratic model" of Section 4, the comparison was 103 Lasso steps versus 
64 for LARS. 

3.2. The LARS-Stagewise relationship. The staircase in Figure 2 indi- 
cates how the Stagewise algorithm might proceed forward from /i^ , a point 
of equal current correlations ci =£2, (2.8). The first small step has (ran- 
domly) selected index j = 1, taking us to fii + exi. Now variable 2 is more 
correlated, 

(3.7) X2(y - /ii - exi) > Xi(y - /ii - exi), 
forcing j = 2 to be the next Stagewise choice and so on. 
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We will consider an idealized Stagewise procedure in which the step size 
£ goes to zero. This collapses the staircase along the direction of the bisector 
U2 in Figure 2, making the Stagewise and LARS estimates agree. They 
always agree for m = 2 covariates, but another modification is necessary for 
LARS to produce Stagewise estimates in general. Section 6 verifies the main 
result described next. 

Suppose that the Stagewise procedure has taken N steps of infinitesimal 
size £ from some previous estimate /i, with 

(3.8) Nj = 7^{steps with selected index j}, j = 1, 2, . . . , m. 

It is easy to show, as in Lemma 11 of Section 6, that Nj = for j not in the 
active set A defined by the current correlations x'(y — /i), (2.9). Letting 

(3.9) P={Ni,N2^...,N^)/N, 

with P4 indicating the coordinates of P for j G A, the new estimate is 

(3.10) /x = /i + iVeX^P4 [(2.4)]. 

(Notice that the Stagewise steps are taken along the directions SjXj.) 
The LARS algorithm (2.14) progresses along 

(3.11) t^A + l^AWA, where WA = AAg2'U [(2.6)-(3.2)]. 

Comparing (3.10) with (3.11) shows that LARS cannot agree with Stagewise 
if Wj[ has negative components, since P4 is nonnegative. To put it another 
way, the direction of Stagewise progress X^^P^ must lie in the convex cone 
generated by the columns of X_a, 



I ieA ) 



(3.12) 

If u^ € C_A then there is no contradiction between (3.12) and (3.13). If 
not it seems natural to replace u^ with its projection into C^, that is, the 
nearest point in the convex cone. 

Stagewise modification. Proceed as in (2.8)-(2.13), except with Uyi 
replaced by Ug, the unit vector lying along the projection of u^ into C^. 
(See Figure 9 in Section 6.) 

Theorem 2. Under the Stagewise modification, the LARS algorithm 
yields all Stagewise solutions. 

The vector u^ in the Stagewise modification is the equiangular vector 
(2.6) for the subset B (^ A corresponding to the face of C^ into which the 
projection falls. Stagewise is a LARS type algorithm that allows the active 
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set to decrease by one or more indices. This happened at the arrowed point 
in the right panel of Figure 1: there the set A= {3,9,4,7,2,10,5,8} was 
decreased to B = A — {3, 7}. It took a total of 13 modified LARS steps to 
reach the full OLS solution (3^ = {X'X)^^X'y. The three methods, LARS, 
Lasso and Stagewise, always reach OLS eventually, but LARS does so in 
only m steps while Lasso and, especially, Stagewise can take longer. For the 
171 = 64 quadratic model of Section 4, Stagewise took 255 steps. 

According to Theorem 2 the difference between successive Stagewise- 
modified LARS estimates is 

(3.13) /i_4^ - /i_4 = 7Ug = 7XgUjg, 

as in (3.13). Since ug exists in the convex cone C^^, w^ must have nonnegative 
components. This says that the difference of successive coefficient estimates 
for coordinate j (z B satisfies 

(3.14) sign0+j -pj) = Sj, 

where Sj = sign{x'(y — p-)}. 

We can now make a useful comparison of the three methods: 

1. Stagewise — successive differences of /3j agree in sign with the current 
correlation Cj = x' (y — /i) ; 

2. Lasso — Pj agrees in sign with cj; 

3. LARS — no sign restrictions (but see Lemma 4 of Section 5). 

From this point of view, Lasso is intermediate between the LARS and Stage- 
wise methods. 

The successive difference property (3.14) makes the Stagewise /3j esti- 
mates move monotonically away from 0. Reversals are possible only if Cj 
changes sign while Pj is "resting" between two periods of change. This hap- 
pened to variable 7 in Figure 1 between the 8th and 10th Stagewise-modified 
LARS steps. 

3.3. Simulation study. A small simulation study was carried out com- 
paring the LARS, Lasso and Stagewise algorithms. The X matrix for the 
simulation was based on the diabetes example of Table 1, but now using 
a "Quadratic Model" having m = 64 predictors, including interactions and 
squares of the 10 original covariates: 

(3.15) Quadratic Model 10 main effects, 45 interactions, 9 squares, 

the last being the squares of each Xj except the dichotomous variable X2. The 
true mean vector fj, for the simulation was /x = XP, where (3 was obtained by 
running LARS for 10 steps on the original (X, y) diabetes data (agreeing in 
this case with the 10-step Lasso or Stagewise analysis). Subtracting /i from 
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a centered version of the original y vector of Table 1 gave a vector s ■ 



i/^r + 



of n = 442 residuals. The "true i?^" for this model, ||/i 
equaled 0.416. 

100 simulated response vectors y* were generated from the model 



y- 

llel 






(3.16) 



I ^ 



with e* = (ej^, £2) • • • ) £n) a random sample, with replacement, from the com- 
ponents of e. The LARS algorithm with K = AQ steps was run for each simu- 
lated data set (X, y*), yielding a sequence of estimates pS '*, /c = 1,2, . . . ,40, 
and likewise using the Lasso and Stagewise algorithms. 

Figure 5 compares the LARS, Lasso and Stagewise estimates. For a given 
estimate /i define the proportion explained pe(/i) to be 



1 



(3.17) pe(^) = i- ||/x-/i||7||/x| 

so pe(0) = and pe(/x) = 1. The solid curve graphs the average of pe(/i'^ '*) 
over the 100 simulations, versus step number k for LARS, A; = 1, 2, . . . ,40. 
The corresponding curves are graphed for Lasso and Stagewise, except that 
the horizontal axis is now the average number of nonzero (3'- terms composing 

/i^ '* . For example, pS '* averaged 33.23 nonzero terms with Stagewise, 
compared to 35.83 for Lasso and 40 for LARS. 

Figure 5's most striking message is that the three algorithms performed 
almost identically, and rather well. The average proportion explained rises 
quickly, reaching a maximum of 0.963 at A; = 10, and then declines slowly 
as k grows to 40. The light dots display the small standard deviation of 




Fig. 5. Simulation study comparing LARS, Lasso and Stagewise algorithm's; 100 repli- 
cations of model (3.15)-(3.16). Solid curve shows average proportion explained, (3.17), 
for LARS estim,ates as function of num,ber of steps fc = 1,2, . . . ,40; Lasso and Stagewise 
give nearly identical results; small dots indicate plus or minus one standard deviation over 
the 100 simulations. Classic Forward Selection (heavy dashed curve) rises and falls more 
abruptly. 
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pe(^^ '*) over the 100 simulations, roughly ±0.02. Stopping at any point 
between k = 5 and 25 typicahy gave a p.^ '* with true predictive R^ about 
0.40, compared to the ideal value 0.416 for /x. 

The dashed curve in Figure 5 tracks the average proportion explained 
by classic Forward Selection. It rises very quickly, to a maximum of 0.950 
after k = 3 steps, and then falls back more abruptly than the LARS-Lasso- 
Stagewise curves. This behavior agrees with the characterization of Forward 
Selection as a dangerously greedy algorithm. 

3.4. Other LARS modifications. Here are a few more examples of LARS 
type model-building algorithms. 

Positive Lasso. Constraint (1.5) can be strengthened to 

(3.18) minimize S{(3) subject to T(/3) < t and all (3j > 0. 

This would be appropriate if the statisticians or scientists believed that 
the variables Xj must enter the prediction equation in their defined direc- 
tions. Situation (3.18) is a more difficult quadratic programming problem 
than (1.5), but it can be solved by a further modification of the Lasso- 
modified LARS algorithm: change |E,| to Cj at both places in (2.9), set 
Sj = 1 instead of (2.10) and change (2.13) to 

( C — c 

(3.19) 7=min+<^-; ^ 

j&A^ [Aa-q 

The positive Lasso usually does not converge to the full OLS solution Pm, 
even for very large choices of t. 

The changes above amount to considering the Xj as generating half-lines 
rather than full one-dimensional spaces. A positive Stagewise version can 
be developed in the same way, and has the property that the f3j tracks are 
always monotone. 

LARS-OLS hybrid. After k steps the LARS algorithm has identified a 
set Ak of covariates, for example, A4, = {3,9,4, 7} in the diabetes study. In- 
stead of j3p, we might prefer ^^, the OLS coefficients based on the linear 
model with covariates in Ak — using LARS to find the model but not to esti- 
mate the coefficients. Besides looking more familiar, this will always increase 
the usual empirical R^ measure of fit (though not necessarily the true fitting 
accuracy) , 

(3.20) R\P,) - rW) = ]~P'' [R\h) - R\h-i)]. 

Pk[^- Pk) 

where pk = %/% as in (2.22). 
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The increases in R^ were small in the diabetes example, on the order of 
0.01 for /c > 4 compared with R^ = 0.50, which is expected from (3.20) since 
we would usually continue LARS until R?'{f3k) — -R^(/3fc_i) was small. For 
the same reason /3^ and [3]^ are likely to lie near each other as they did in 
the diabetes example. 

Main effects first. It is straightforward to restrict the order in which 
variables are allowed to enter the LARS algorithm. For example, having 
obtained Ai = {3, 9, 4, 7} for the diabetes study, we might then wish to check 
for interactions. To do this we begin LARS again, replacing y with y — 
/X4 and X with the n x 6 matrix whose columns represent the interactions 

X3:95X3;4, . . . 5X4:7. 

Backward Lasso. The Lasso-modified LARS algorithm can be run back- 
ward, starting from the full OLS solution /3„. Assuming that all the co- 
ordinates of /3„^ are nonzero, their signs must agree with the signs Sj that 
the current correlations had during the final LARS step. This allows us to 
calculate the last equiangular direction u_4, (2.4)-(2.6). Moving backward 
from /i„ = X^^ along the line /x(7) = /i^ — 7u_4, we eliminate from the 
active set the index of the first f3j that becomes zero. Continuing backward, 
we keep track of all coefficients jSj and current correlations Cj, following es- 
sentially the same rules for changing A as in Section 3.1. As in (2.3), (3.5) 
the calculation of 7 and 7 is easy. 

The crucial property of the Lasso that makes backward navigation pos- 
sible is (3.1), which permits calculation of the correct equiangular direction 
u_4 at each step. In this sense Lasso can be just as well thought of as a 
backward-moving algorithm. This is not the case for LARS or Stagewise, 
both of which are inherently forward-moving algorithms. 

4. Degrees of freedom and Cp estimates. Figures 1 and 3 show all pos- 
sible Lasso, Stagewise or LARS estimates of the vector j3 for the diabetes 
data. The scientists want just a single (3 of course, so we need some rule for 
selecting among the possibilities. This section concerns a Cp-type selection 
criterion, especially as it applies to the choice of LARS estimate. 

Let p, = g{y) represent a formula for estimating /i from the data vector 
y. Here, as usual in regression situations, we are considering the covariate 
vectors xi,X2, . . . ,Xm fixed at their observed values. We assume that given 
the x's, y is generated according to an homoskedastic model 

(4.1) y~(/i,a2l), 

meaning that the components yi are uncorrelated, with mean //j and variance 
o"^. Taking expectations in the identity 

(4.2) {fii - fiif = {yi - fLif - {yi - mf + 2{fii - fii){yi - fii), 
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and summing over i, yields 

(4.3) E\ K = E\^^ n\+2l^ -^ . 



4 = 1 

The last term of (4.3) leads to a convenient definition of the degrees of 
freedom for an estimator /i = g{y), 

n 

(4.4) d4^<^2 =^cov(/Ij,2/i)/cr^ 

i=l 

and a Cp-type risk estimation formula, 

(4.5) Cpip) = "^~,^" - n + 2d/^,,2 . 

If cj^ and dfn^(j2 are known, Cpifi) is an unbiased estimator of the true 
risk -E{||/i — ^Ip/cr^}. For linear estimators /i = My, model (4.1) makes 
df^„2 =trace(M), equaling the usual definition of degrees of freedom for 
OLS, and coinciding with the proposal of Mallows (1973). Section 6 of 
Efron and Tibshirani (1997) and Section 7 of Efron (1986) discuss formulas 
(4.4) and (4.5) and their role in Cp, Akaike information criterion (AIC) and 
Stein's unbiased risk estimated (SURE) estimation theory, a more recent 
reference being Ye (1998). 

Practical use of Cp formula (4.5) requires preliminary estimates of /x,(T^ 
and d/^^o-2 . In the numerical results below, the usual OLS estimates fi and a^ 
from the full OLS model were used to calculate bootstrap estimates of df^^„2 ; 
bootstrap samples y* and replications /i* were then generated according to 

(4.6) y*~iV(/i,a2) and /i* = 5(y*). 

Independently repeating (4.6) say B times gives straightforward estimates 
for the covariances in (4.4), 

xj — 1 H 

and then 

n 

(4.8) ? = 5]cOT,/a2. 

i=l 

Normality is not crucial in (4.6). Nearly the same results were obtained using 
y* = /i* + e*, where the components of e* were resampled from e = y — fi. 
The left panel of Figure 6 shows dfj^ for the diabetes data LARS esti- 
mates /i^, fc = 1, 2, . . . , m = 10. It portrays a startlingly simple situation that 
we will call the "simple approximation," 

(4.9) d/(/2fc) = k. 
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Fig. 6. Degrees of freedom for LARS estimates /Xj./ (left) diabetes study, Table 1, fc = 1, 
2,. . . ,m — 10; (right) quadratic model (3.15) for the diabetes data, m = 64. Solid line is 
simple approximation dfk = k. Dashed lines are approximate 95% confidence intervals for 
the bootstrap estimates. Each panel based on B = 500 bootstrap replications. 

The right panel also applies to the diabetes data, but this time with the 
quadratic model (3.15), having m = 64 predictors. We see that the simple 
approximation (4.9) is again accurate within the limits of the bootstrap 
computation (4.8), where B = 500 replications were divided into 10 groups 
of 50 each in order to calculate Student-i confidence intervals. 

If (4.9) can be believed, and we will offer some evidence in its behalf, we 
can estimate the risk of a A;-step LARS estimator /i^ by 

(4.10) Cpifik) = l|y - Afcll V^' -n + 2k. 

The formula, which is the same as the Cp estimate of risk for an OLS es- 
timator based on a subset of k preselected predictor vectors, has the great 
advantage of not requiring any further calculations beyond those for the 
original LARS estimates. The formula applies only to LARS, and not to 
Lasso or Stagewise. 

Figure 7 displays Cp{p,f.) as a function of k for the two situations of 
Figure 6. Minimum Cp was achieved at steps k = 7 and k = 16, respectively. 
Both of the minimum Cp models looked sensible, their first several selections 
of "important" covariates agreeing with an earlier model based on a detailed 
inspection of the data assisted by medical expertise. 

The simple approximation becomes a theorem in two cases. 

Theorem 3. If the covariate vectors xi,X2, . . . ,Xm are mutually orthog- 
onal, then the k-step LARS estimate /i^. has df{fi^) = k. 



To state the second more general setting we introduce the following con- 
dition. 
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Fig. 7. Cp estimates of risk (4.10) for the two situations of Figure 6; (left) jti = 10 model 
has smallest Cp at k = 7; (right) m = 64 model has smallest Cp at k — 16. 



Positive cone condition. For all possible subsets X^ of the full de- 
sign matrix X, 

(4.11) g:^'u>o, 

where the inequality is taken element-wise. 

The positive cone condition holds if X is orthogonal. It is strictly more 
general than orthogonality, but counterexamples (such as the diabetes data) 
show that not all design matrices X satisfy it. 

It is also easy to show that LARS, Lasso and Stagewise all coincide under 
the positive cone condition, so the degrees-of-freedom formula applies to 
them too in this case. 

Theorem 4. Under the positive cone condition, df{fif,) = k. 



IS 



The proof, which appears later in this section, is an application of Stein's 
unbiased risk estimate (SURE) [Stein (1981)]. Suppose that g:W ^W 
almost differentiable (see Remark A.l in the Appendix) and set V • g 
X^iLi (^9i/(^Xi. If y ~ Nn{fJ^, cr^I), then Stein's formula states that 



(4.12) 



E 



cov{gi,yi)/a'^ = E[V ■ g{y)]. 



The left-hand side is df{g) for the general estimator ^(y). Focusing specif- 
ically on LARS, it will turn out that V • fikiy) = A: in all situations with 
probability 1, but that the continuity assumptions underlying (4.12) and 
SURE can fail in certain nonorthogonal cases where the positive cone con- 
dition does not hold. 
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A range of simulations suggested that the simple approximation is quite 
accurate even when the Xj's are highly correlated and that it requires con- 
certed effort at pathology to make df{fi^.) much different than k. 

Stein's formula assumes normality, y r^ N{fi, cj^I). A cruder "delta method" 
rationale for the simple approximation requires only homoskedasticity, (4.1). 
The geometry of Figure 4 implies 

(4.13) Ji,, = yk- cotfc • ||yfe+i - yfcll, 

where cot^ is the cotangent of the angle between u^ and u^+i, 



(4.14) cotfc 



[l-«Ufc+0^]^/^' 



Let Vfc be the unit vector orthogonal to C{Xb), the linear space spanned by 
the first k covariates selected by LARS, and pointing into C{Xk+i) along 
the direction of y^+i — yfc- For y* near y we can reexpress (4.13) as a locally 
linear transformation, 

(4.15) /ifc = /ifc + A4(y* -y) with M^ = P^ - cot^ • u^v'^, 

Pk being the usual projection matrix from M" into C{Xk); (4.15) holds within 
a neighborhood of y such that the LARS choices C{Xk) and v^ remain the 
same. 

The matrix Mf^ has trace(Mfc) = k. Since the trace equals the degrees of 
freedom for linear estimators, the simple approximation (4.9) is seen to be 
a delta method approximation to the bootstrap estimates (4.6) and (4.7). 

It is clear that (4.9) d/(/i;,.) = k cannot hold for the Lasso, since the 
degrees of freedom is m for the full model but the total number of steps 
taken can exceed m. However, we have found empirically that an intuitively 
plausible result holds: the degrees of freedom is well approximated by the 
number of nonzero predictors in the model. Specifically, starting at step 0, 
let i{k) be the index of the last model in the Lasso sequence containing 
k predictors. Then df(p,£fj^\) = k. We do not yet have any mathematical 
support for this claim. 

4.1. Orthogonal designs. In the orthogonal case, we assume that Xj = ej 
for j = 1, . . . ,m. The LARS algorithm then has a particularly simple form, 
reducing to soft thresholding at the order statistics of the data. 

To be specific, define the soft thresholding operation on a scalar yi at 
threshold t by 



viyr^t) 



yi -t, 


if yi > t, 


0, 


if |yi|<i, 


yi + t, 


if 2/1 < -t. 
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The order statistics of the absolute values of the data are denoted by 

(4-16) |y|(i) > |y|(2) > • • • > |y|(n) > |y|(n+i) := 0. 

We note that ym+i, ■ ■ ■ ,yn do not enter into the estimation procedure, and 
so we may as well assume that m = n. 

Lemma 1. For an orthogonal design with Xj = ej,j = 1, . . . ,n, the kth 
LARS estimate (0 <k <n) is given by 

(yi-\y\ik+i), ifyi>\y\{k+i), 

(4.17) fikAy) = s 0' ^f \yi\ ^ \ykk+i)' 

[yi + \y\{k+i), ifyi<-\y\ik+i), 

(4.18) =v{yu\y\ik+i))- 

Proof. The proof is by induction, stepping through the LARS se- 
quence. First note that the LARS parameters take a simple form in the 
orthogonal setting: 

We assume for the moment that there are no ties in the order statistics 
(4.16), so that the variables enter one at a time. Let j{l) be the index 
corresponding to the Zth order statistic, \y\{i) = siyj^iy. we will see that Ak = 

{j-(i),---,jW}- 

We have x'y = ?/j, and so at the first step LARS picks variable j{l) and 
sets Ci = |y|(i). It is easily seen that 

71 = min {|2/|(i) - \yj\} = \y\{i) - |y|(2) 

and so 

Ai = [|y|(i) - |2/I(2)]ej(i), 

which is precisely (4.17) for fc = 1. 

Suppose now that step k — 1 has been completed, so that Ak = {j(l); • ■ • > j(^)} 
and (4.17) holds for (ik~i- The current correlations Ck = \y\(k) and Ckj = yj 
for j ^ Ak- Since Ak — flfcj = k"^'"^, we have 

%= rninA;i/2||y| -|y.|} 

and 

7fcUfc = [|y|(fe) - |y|(fc+i)]l{j G Ak}. 
Adding this term to fik-i yields (4.17) for step k. 
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The argument clearly extends to the case in which there are ties in the 
order statistics (4.16): if |y|(fc+i) = ••• = \y\(k+r)j then ^fc(y) expands by r 
variables at step fc + 1 and /i^_|_^,(y), z^ = 1, . . . , r, are all determined at the 
same time and are equal to fj,j._^_i{y). D 

Proof of Theorem 4 (Orthogonal case). The argument is particu- 
larly simple in this setting, and so worth giving separately. First we note 
from (4.17) that /x;^. is continuous and Lipschitz(l) and so certainly almost 
differentiable. Hence (4.12) shows that we simply have to calculate V • /x^. 
Inspection of (4.17) shows that 

= J2^^\yi\> \y\{k+i)} = k 

i 

almost surely, that is, except for ties. This completes the proof. D 

4.2. The divergence formula. While for the most general design matrices 
X, it can happen that fij^ fails to be almost differentiable, we will see that 
the divergence formula 

(4.19) V-Afe(y)=A; 

does hold almost everywhere. Indeed, certain authors [e.g., Meyer and Woodroofe 
(2000)] have argued that the divergence V • /i of an estimator provides itself 
a useful measure of the effective dimension of a model. 

Turning to LARS, we shall say that /i(y) is locally linear at a data point 
2/0 if there is some small open neighborhood of y^ on which /i(y) = My is 
exactly linear. Of course, the matrix M = M{yQ) can depend on yo — in the 
case of LARS, it will be seen to be constant on the interior of polygonal 
regions, with jumps across the boundaries. We say that a set G has full 
measure if its complement has Lebesgue measure zero. 

Lemma 2. There is an open set Gk of full measure such that, at all 
y GGk, Afc(y) ^-s locally linear and V • fikiy) = k. 

Proof. We give here only the part of the proof that relates to actual 
calculation of the divergence in (4.19). The arguments establishing continu- 
ity and local linearity are delayed to the Appendix. 

So, let us fix a point y in the interior of Gk- From Lemma 13 in the Ap- 
pendix, this means that near y the active set Ak{y) is locally constant, that 
a single variable enters at the next step, this variable being the same near 
y. In addition, /ifc(y) is locally linear, and hence in particular differentiable. 
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Since Gk C Gi for I <k, the same story applies at all previous steps and we 
have 

k 

(4.20) Afc(y) = E^'(y)"^ 

1=1 

Differentiating the jth component of vector fij.{y) yields 
In particular, for the divergence 

n o^ k 

(4.21) V./ife(y) = E ^ = E(V7^u,), 

i=i ^yi 1=1 

the brackets indicating inner product. 

The active set is Ak = {1,2, . . . ,k} and x^+i is the variable to enter next. 
For k > 2, write d^ = x/ — x^ for any choice / < k — as remarked in the 
Conventions in the Appendix, the choice of / is immaterial (e.g., / = 1 for 
definiteness). Let bk+i = {Sk+i,Uk), which is nonzero, as argued in the proof 
of Lemma 13. As shown in (A. 4) in the Appendix, (2.13) can be rewritten 

(4.22) 7fc(y) = 6^^,(<5,+i,y-/i,_i). 

For k>2, define the linear space of vectors equiangular with the active set 

Ck = Ck{y) ={u:(xi,u) = ...= (xfc,u) for x, with / S A(y)}. 

[We may drop the dependence on y since Ak{y) is locally fixed.] Clearly 
dim Ck = n — k + l and 

(4.23) UfcG£fc, A+iC^fc. 
We shall now verify that, for each A; > 1, 

(4.24) (V7fc,Ufc) = l and (V7fc,u) = forue^+i. 

Formula (4.21) shows that this suffices to prove Lemma 2. 

First, for /c = 1 we have 71 (y) = ^^ (<52 5y) and (V7i,u) =62^ {82,^1), and 
that 

, f , , , [bo, if u = ui, 

(<52,u) = (xi-x2,u) = |q^ if ue/:^. 

Now, for general A;, combine (4.22) and (4.20): 

^fc+i7fc(y) = (^A;+i,y) - Y^{8k+i,\i.ihi{y), 

1=1 
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and hence 

fc-i 
6fc+i(V7fc,u) = (5fc+i,u) -J2{^k+i,ui){\/-fi,u). 

1=1 

From the definitions of b^+i and £fc+i we have 

bk+i, if u = Ufc, 
0, if ue£fe+i- 

Hence the truth of (4.24) for step k follows from its truth at step k — 1 

because of the containment properties (4.23). D 



(dfe+i, u) = (x; - Xfe+i) = <^ 



4.3. Proof of Theorem 4. To complete the proof of Theorem 4, we state 
the following regularity result, proved in the Appendix. 

Lemma 3. Under the positive cone condition, p-kiy) ^^ continuous and 
almost differentiable. 

This guarantees that Stein's formula (4.12) is valid for /i^ under the posi- 
tive cone condition, so the divergence formula of Lemma 2 then immediately 
yields Theorem 4. 

5. LARS and Lasso properties. The LARS and Lasso algorithms are 
described more carefully in this section, with an eye toward fully under- 
standing their relationship. Theorem 1 of Section 3 will be verified. The 
latter material overlaps results in Osborne, Presnell and Turlach (2000a), 
particularly in their Section 4. Our point of view here allows the Lasso to 
be described as a quite simple modification of LARS, itself a variation of 
traditional Forward Selection methodology, and in this sense should be more 
accessible to statistical audiences. In any case we will stick to the language 
of regression and correlation rather than convex optimization, though some 
of the techniques are familiar from the optimization literature. 

The results will be developed in a series of lemmas, eventually lending 
to a proof of Theorem 1 and its generalizations. The first three lemmas 
refer to attributes of the LARS procedure that are not specific to its Lasso 
modification. 

Using notation as in (2.17)-(2.20), suppose LARS has completed step 
k — 1, giving estimate /i/^._i and active set Ak for step k, with covariate x^ 
the newest addition to the active set. 

Lemma 4. // x^ is the only addition to the active set at the end of 
step k — 1, then the coefficient vector W}^ = A}^Q'^ 1^ for the equiangular 
vector Ufc = X^wt, (2.6), has its kth component Wkk agreeing in sign with 
the current correlation Ckk = x^(y — Mfc-i)- Moreover, the regression vector 
(3f. for p,j^ = XPf. has its kth component (3kk agreeing in sign with Ckk- 
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Lemma 4 says that new variables enter the LARS active set in the "cor- 
rect" direction, a weakened version of the Lasso requirement (3.1). This wih 
turn out to be a crucial connection for the LARS-Lasso relationship. 

Proof of Lemma 4. The case A; = 1 is apparent. Note that since 
(2.20), from (2.6) we have 

(5.1) wk = AkC^\ix',Xkr'xUy-nk-i)] ■.= AkCi:'wi 

The term in square braces is the least squares coefficient vector in the re- 
gression of the current residual on X^, and the term preceding it is positive. 
Note also that 

(5.2) X[(y-yfc_i) = (0,<5)' with <5 > 0, 

since X[,_-^(y — y^-i) = by definition (this has k — 1 elements), and 
Cfc(7) = x^.(y — 7Ufc-i) decreases more slowly in 7 than 0^(7) for j S Ak~i- 

r<Cj(7), _ for7<7fc-i, 

(5.3) cfc(7) < =Cj(7) =Cfc, for7 = 7fc_i, 

l>Cj(7), for 7fc_i <7<7fc_i. 

Thus 

(5.4) wl = {X'^XuT^X'^iy - y^^i + y^^i - /ifc_i) 

(5.5) = (X^Xfc)-i (^J) + (X^Xfc)-iX^[(7fc_i -7fe_i)ufc^i]. 

The kth element of tD^ is positive, because it is in the first term in (5.5) 
[(X^Xfc) is positive definite], and in the second term it is since Ufc_i G 
C{Xk-i). 

This proves the first statement in Lemma 4. The second follows from 

(5.6) Pkk = Pk-i,k + 7kWkk, 

and /?fc-i,fc = 0, Xfc not being active before step k. D 

Our second lemma interprets the quantity A_4^ = (l'(?^ 1)~^'^, (2.4) and 
(2.5). Let S_/[ indicate the extended simplex generated by the columns of 
Xa, 

(5.7) Sa=L=J2 '^^jPj : E ^:/- = 4' 

I jeA jeA ) 

"extended" meaning that the coefficients Pj are allowed to be negative. 
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Lemma 5. The point in 5_4 nearest the origin is 

(5.8) v^ = A^u^ = AjiXji,w_A where wj^ = Aj\^Qj^lj(, 

with length ||v_4|| = A_a. If A^ B, then A_a > Ajs, the largest possible value 
being A_4 = 1 for A a singleton. 

Proof. For any v S 5_4, the squared distance to the origin is ||X_4P|p = 
P'Qji^P. Introducing a Lagrange multipher to enforce the summation con- 
straint, we differentiate 

(5.9) P'GaP-K^'aP-^)^ 

and find that the minimizing P4 = \Q'^ '^a- Summing, we get Wj^Q^ 1^ = 
1, and hence 

(5.10) PA = Al^g^HA = AAWA. 
Hence v_4 = XaPa ^ <Sa ^-iid 

(5.11) llv^f = P^^Ga'Pa = A\Vj,G"^\a = A\, 

verifying (5.8). If ^ C ;B, then Sa ^ 5g, so the nearest distance Aq must be 
equal to or less than the nearest distance A a- A a obviously equals 1 if and 
only if A has only one member. D 

The LARS algorithm and its various modifications proceed in piecewise 
linear steps. For ?7i-vectors j3 and d, let 

(5.12) /3(7)=3 + 7d and S{^) = \\y - X(3{-i)f . 

Lemma 6. Letting c = X'(y — Xj3) be the current correlation vector at 

(5.13) 5(7) - 5(0) = -2c'd7 + d'X'Xd-f^. 

Proof. 5(7) is a quadratic function of 7, with first two derivatives at 
7 = 0, 

(5.14) S'(0) = -2c'd and S{0) = 2d'X'Xd. D 

The remainder of this section concerns the LARS-Lasso relationship. Now 
(3 = (3{t) will indicate a Lasso solution (1.5), and likewise Jjl = p,{t) = X(3{t). 
Because S{P) and T{f3) are both convex functions of (3, with 5* strictly 
convex, standard results show that /3(t) and p,{t) are unique and continuous 
functions of t. 

For a given value of t let 

(5.15) A = {j:Pj{t)^0}. 
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We will show later that A is also the active set that determines the equian- 
gular direction u_4, (2.6), for the LARS-Lasso computations. 

We wish to characterize the track of the Lasso solutions /3(t) or equiva- 
lently of fi{t) as t increases from to its maximum effective value. Let T be 
an open interval of the t axis, with infimum to, within which the set A of 
nonzero Lasso coefficients f3j{t) remains constant. 

Lemma 7. The Lasso estimates /i(t) satisfy 

(5.16) n{t)=n{to)+AA{t-to)uA 

for t ^T, where u_4 is the equiangular vector X^wa, w_a = A^Q^ 1-4; (2-7). 

Proof. The lemma says that, for t in T, p,{t) moves linearly along the 
equiangular vector u^ determined by A. We can also state this in terms of 
the nonzero regression coefficients /9^(t), 

(5.17) PA{t) = Mio) + SAAA{t - to)wA, 

where Sa is the diagonal matrix with diagonal elements Sj, j € A. [Sa is 
needed in (5.17) because definitions (2.4), (2.10) require p,{t) = X(3{t) = 

XASAMt)] 

Since /3(t) satisfies (1.5) and has nonzero set A, it also minimizes 

(5.18) S0A) = \\Y-XASA'^Af 
subject to 

(5.19) Z-j^il^i ~^ ^'^'^ sign(/3j ) = Sj for j G ^. 

A 

[The inequality in (1.5) can be replaced by T{[3) = t as long as t is less than 
X) |/3j| for the full r?i- variable OLS solution /3,.„.] Moreover, the fact that the 
minimizing point /?^(t) occurs strictly inside the simplex (5.19), combined 
with the strict convexity of S{(3a)^ implies we can drop the second condition 
in (5.19) so that /?y^(t) solves 

(5.20) minimize {S(I3a)} subject to y~^ Sj[3j = t. 

A 

Introducing a Lagrange multiplier, (5.20) becomes 

(5.21) minimize \\\y - XaSaI3a\[^ + ^^Sjj3j. 

A 

Differentiating we get 

(5.22) - SaX'aIy - XaSaPa) + ^SaU = 0. 
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Consider two values ti and t2 in '^ with to <ti <t2- Corresponding to 
each of these are values for the Lagrange multiplier A such that Ai > A2, 
and solutions /3^(ti) and /?^(t2)- Inserting these into (5.22), differencing 
and premultiplying by 5^ we get 

(5.23) X'^XASA0A{t2) - Mh)) = (Ai - Aa)!^. 
Hence 

(5.24) pA{t2) - PAih) = (Ai - X2)SaGa^^A- 

However, s^[(/3^(t2) — /3yt(ii)] = ^2 — ^i according to the Lasso definition, so 



(5.25) 
and 



t2-ti = (Ai - X2)sUSaQa'^a = (^1 - ^2)1:4^^'!^ = (Ai - X2)A 



-2 



A 



(5.26) dA{t2) - Mtl) = SAA\{t2 - ti)g^^lA = SAAAit - h)wA. 

Letting t2 = t and ti — > io gives (5.17) by the continuity of (3{t), and fi- 
nally (5.16). Note that (5.16) implies that the maximum absolute correlation 
C(t) equals C{tQ) — A'^{t — Iq), so that C{t) is a piecewise linear decreasing 
function of the Lasso parameter t. D 

The Lasso solution (3{t) occurs on the surface of the diamond-shaped 
convex polytope 

(5.27) V{t) = {l3:Y,\P,\<t}, 

V^t) increasing with t. Lemma 7 says that, for t €T, (3{t) moves linearly 
along edge A of the polytope, the edge having Pj = for j ^ A. Moreover the 
regression estimates p.{t) move in the LARS equiangular direction u_4, (2.6). 
It remains to show that "A" changes according to the rules of Theorem 1, 
which is the purpose of the next three lemmas. 

Lemma 8. A Lasso solution (3 has 

(5.28) Cj=C'-sign(^,) for j € A, 

where Cj equals the current correlation x' (y — /i) = x'(y — X(3). In partic- 
ular, this implies that 

(5.29) sign(^j) = sign(cj) for j G A. 

Proof. This follows immediately from (5.22) by noting that the jth 
element of the left-hand side is Cj, and the right-hand side is A • sign(/?j) for 
j G A. Likewise A = \cj \ = C- □ 
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Lemma 9. Within an interval T of constant nonzero set A, and also 
at to = inf(T), the Lasso current correlations Cj{t) = x' (y — /i(t)) satisfy 



and 
(5.30) 



\cj{t)\ = C{t)=max{\ce{t)\} forjeA 



\cjit)\<Cit) forJiA. 



Proof. Equation (5.28) says that the |cj(t)| have identical values, say 
Ct, for j £ A. It remains to show that Ct has the extremum properties 
indicated in (5.30). For an 77i-vector d we define /3(7) = /3(t) + 7d and S{'^) 
as in (5.12), likewise T{-i) =Y.\Pj{l)l and 



(5.31) 



Rt{d) = -5(0)/r(0). 



Again assuming (3j > for j G A, by redefinition of Xj if necessary, (5.14) 
and (5.28) yield 



(5.32) Rtid) = 2 



CtY.dj +Y.<'j{t)dj 
A A" 



A A" 



Q 




Fig. 8. Plot of S versus T for Lasso applied to diabetes data; points indicate the 12 
modified LARS steps of Figure 1; triangle is (T, 5") boundary point at t = 1000; dashed 
arrow is tangent at t — 1000, negative slope Rt, (5.31). The {T,S) curve is a decreasing, 
convex, quadratic spline. 
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If dj = for j ^ A, and E dj + 0, 

(5.33) Rt(A) = 1Cu 

while if d has only component j nonzero we can make 

(5.34) Rt{A) = 2\cj(t)\. 

According to Lemma 7 the Lasso solutions for t^T use dji, proportional 
to WJ^ with dj = for j ^ A, so 

(5.35) Rt = Rt{wj^) 

is the downward slope of the curve (T, S{T)) at T = t, and by the definition 
of the Lasso must maximize Rt{d). This shows that Ct = C{t), and veri- 
fies (5.30), which also holds at to = inf(T) by the continuity of the current 
correlations. D 

We note that Lemmas 7-9 follow relatively easily from the Karush-Kuhn- 
Tucker conditions for optimality for the quadratic programming Lasso prob- 
lem [Osborne, Presnell and Turlach (2000a)]; we have chosen a more geo- 
metrical argument here to demonstrate the nature of the Lasso path. 

Figure 8 shows the (T, S) curve corresponding to the Lasso estimates in 
Figure 1. The arrow indicates the tangent to the curve at t = 1000, which 
has downward slope i?iooo- The argument above relies on the fact that Rt{d.) 
cannot be greater than Rf , or else there would be (T, S) values lying below 
the optimal curve. Using Lemmas 3 and 4 it can be shown that the (T, S) 
curve is always convex, as in Figure 8, being a quadratic spline with S{T) = 
-2C{T) and S{T) = 2A\. 

We now consider in detail the choice of active set at a breakpoint of the 
piecewise linear Lasso path. Let t = to indicate such a point, to = iiif (^) ^-s 
in Lemma 9, with Lasso regression vector /3, prediction estimate /x = X[3, 
current correlations c = X'{y — /i), Sj = sign(cj) and maximum absolute 
correlation C. Define 

(5.36) Ai = {j : dj y^ 0}, Ao = {j :Pj=0 and \cj \ = C}, 

Aio = ^1 U ^0 and A2 = Aiq, and take (3{'y) = (3 + jd for some 771-vector d; 
also 5(7) = ||y-X/3(7)f and T{^) =E\Pj{l)\- 

Lemma 10. The negative slope (5.31) at to is bounded by 2C , 

(5.37) R{d) = -5(0)/t(0) < 2C, 

with equality only if dj = for j € A2- If so, the differences AS = 5(7) — 5(0) 
and AT = T{'y) — T(0) satisfy 

(5.38) AS = -2CAT + L{df-{ATf, 
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(5.39) 
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L{d) = \\Xd/d+\\. 



Proof. We can assume Cj > for all j, by redefinition if necessary, so 
j3j > according to Lemma 8. Proceeding as in (5.32), 



L.A10 A2 



(5.40) R{d) = 2C 

We need dj > for j ^ AqU A2 in order to maximize (5.40), in which case 



LAi A0UA2 



(5.41) R{d) = 2C 



\-Aio A2 



E '^i + E dj 

\-Aio A2 



This is < 2(7 unless dj = for j £ A2, verifying (5.37), and also implying 

(5.42) r(7)=T(0)+7E«'i- 

-4.10 

The first term on the right-hand side of (5.13) is then — 2C(AT), while the 
second term equals {d/d+yX'X{d/d+){AT)^ = L{d)^. D 

Lemma 10 has an important consequence. Suppose that A is the current 
active set for the Lasso, as in (5.17), and that A C ^^q. Then Lemma 5 says 
that L{d) is > j4_4, and (5.38) gives 



(5.43) 



AS > -2C • AT + A^t • (ATf 



with equality if d is chosen to give the equiangular vector u^, dA = Saw a, 
dj^c = 0. The Lasso operates to minimize S{T) so we want AS to be as 
negative as possible. Lemma 10 says that if the support of d is not confined to 
^10, then 5(0) exceeds the optimum value —2(7; if it is confined, then 5(0) = 
—2(7 but 5(0) exceeds the minimum value 2A_a unless dA is proportional to 
Saw A as in (5.17). 

Suppose that /3, a Lasso solution, exactly equals a /3 obtained from the 
Lasso-modified LARS algorithm, henceforth called LARS-Lasso, as at t = 
1000 in Figures 1 and 3. We know from Lemma 7 that subsequent Lasso 
estimates will follow a linear track determined by some subset A, fJ^i^) = 
p, + ^UA, and so will the LARS-Lasso estimates, but to verify Theorem 1 
we need to show that "^" is the same set in both cases. 

Lemmas 4-7 put four constraints on the Lasso choice of A. Define ^1, Ao 
and ^10 as at (5.36). 
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Constraint 1. Ai (^ A. This follows from Lemma 7 since for suffi- 
ciently small 7 the subsequent Lasso coefficients (5.17), 

(5.44) Pa{i)=Pa+iSawa, 

will have Pj{j) / 0, j € A- 

Constraint 2. ^ C Aiq. Lemma 10, (5.37) shows that the Lasso choice d 
in /3(7) = /3 + 7d must have its nonzero support in ^lo, or equivalently that 
/i(7) = /i + l^A must have u_4 G /^(X^^^). (It is possible that u^ happens 
to equal ug for some B D .4io, but that does not affect the argument below.) 

Constraint 3. w_/x = Aj^Q^ 1^ cannot have sign{wj) ^ sign(cj) for 
any coordinate j G Aq. If it does, then sign(/3j(7)) ^ sign(cj(7)) for suffi- 
ciently small 7, violating Lemma 8. 

Constraint 4. Subject to Constraints 1-3, A must minimize Aj\^. This 
follows from Lemma 10 as in (5.43), and the requirement that the Lasso 
curve S{T) declines at the fastest possible rate. 

Theorem 1 follows by induction: beginning at /3o = 0, we follow the LARS- 
Lasso algorithm and show that at every succeeding step it must continue 
to agree with the Lasso definition (1.5). First of all, suppose that /3, our 
hypothesized Lasso and LARS-Lasso solution, has occurred strictly within 
a LARS-Lasso step. Then ^o is empty so that Constraints 1 and 2 imply 
that A cannot change its current value: the equivalence between Lasso and 
LARS-Lasso must continue at least to the end of the step. 

The one-at-a-time assumption of Theorem 1 says that at a LARS-Lasso 
breakpoint, Aq has exactly one member, say jo, so A must equal Ai or 
^10- There are two cases: if jo has just been added to the set {\cj\ = C}, 
then Lemma 4 says that sign{wjg) = sign(cjQ), so that Constraint 3 is not 
violated; the other three constraints and Lemma 5 imply that the Lasso 
choice A = Aio agrees with the LARS-Lasso algorithm. The other case has 
jo deleted from the active set as in (3.6). Now the choice A = Aio is ruled 
out by Constraint 3: it would keep wa the same as in the previous LARS- 
Lasso step, and we know that that was stopped in (3.6) to prevent a sign 
contradiction at coordinate jo- In other words, A = Ai, in accordance with 
the Lasso modification of LARS. This completes the proof of Theorem 1. 

A LARS-Lasso algorithm is available even if the one-at-a-time condition 
does not hold, but at the expense of additional computation. Suppose, for 
example, two new members ji and J2 are added to the set {\cj \ = C}, so Aq = 
{ji) J2}- It is possible but not certain that ^lo does not violate Constraint 
3, in which case A = Aiq. However, if it does violate Constraint 3, then 



34 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI 

both possibilities A = AiU {ji} and ^ = ^i U {J2} must be examined to see 
which one gives the smaller value of A^x. Since one-at-a-time computations, 
perhaps with some added y jitter, apply to all practical situations, the LARS 
algorithm described in Section 7 is not equipped to handle many-at-a-time 
problems. 

6. Stagewise properties. The main goal of this section is to verify Theo- 
rem 2. Doing so also gives us a chance to make a more detailed comparison of 
the LARS and Stagewise procedures. Assume that /3 is a Stagewise estimate 
of the regression coefficients, for example, as indicated at J2 \Pj\ — 2000 in 
the right panel of Figure 1, with prediction vector Jl = X(3, current correla- 
tions c = X'{y — p,), C = max{|cj|} and maximal set A = {j : \cj\ = C}. We 
must show that successive Stagewise estimates of (3 develop according to the 
modified LARS algorithm of Theorem 2, henceforth called LARS-Stagewise. 
For convenience we can assume, by redefinition of Xj as — Xj, if necessary, 
that the signs Sj = sign(cj) are all non-negative. 

As in (3.8)"(3.10) we suppose that the Stagewise procedure (1.7) has 
taken N additional e-steps forward from p, = X/3, giving new prediction 
vector p-{N). 

Lemma 11. For sufficiently small e, only j £ A can have Pj = Nj/N > 
0. 

Proof. Letting Ne = 7, \\ft{N) - /i|| < 7 so that c{N) = X'{y - p.{N)) 
satisfies 

(6.1) \MN)-Cj\ = Wj{n{N)-n)\ < l|x,|| • ||/i(iV) -All <7. 

For 7 < 2 [C* — max_4c {£,■ }] , j in A'^ cannot have maximal current correlation 
and can never be involved in the N steps. D 

Lemma 11 says that we can write the developing Stagewise prediction 
vector as 

(6.2) /x(7) = /x + 7v, where v = X^P4, 

P4 a vector of length |^|, with components Nj/N for j G A. The nature of 
the Stagewise procedure puts three constraints on v, the most obvious of 
which is the following. 

Constraint I. The vector v G S^, the nonnegative simplex 

(6.3) 5 + = I V : V = 5] X, P„ P, > , 5: P, = 1 1 . 
Equivalently, 7V € C^^, the convex cone (3.12). 
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The Stagewise procedure, unlike LARS, is not required to use all of the 
maximal set A as the active set, and can instead restrict the nonzero co- 
ordinates Pj to a subset B CA. Then v G £(Xe), the linear space spanned 
by the columns of Xq, but not all such vectors v are allowable Stagewise 
forward directions. 

Constraint II. The vector v must be proportional to the equiangular 
vector ug, (2.6), that is, v = vg, (5.8), 

(6.4) VB = A|XB^gllB = ABUB. 

Constraint II amounts to requiring that the current correlations in B 
decline at an equal rate: since 

(6.5) Cj (7) = x^- (y - /i - 7v) = Cj - 7Xj. v, 

we need X'^-v = Alg for some A > 0, implying v = XQ^ Ig; choosing A = A^ 
satisfies Constraint II. Violating Constraint II makes the current correlations 
Cj(7) unequal so that the Stagewise algorithm as defined at (1.7) could not 
proceed in direction v. 

Equation (6.4) gives X'^vq = A'^l^, or 

(6.6) x've = yl| for j G R 



1 
Constraint III. The vector v = vg must satisfy 



(6.7) x've > A| for jeA-B. 



Constraint III follows from (6.5). It says that the current correlations for 
members of ^ = {j : \cj\ = C} not in B must decline at least as quickly as 
those in B. If this were not true, then vg would not be an allowable direc- 
tion for Stagewise development since variables in A — B would immediately 
reenter (1.7). 

To obtain strict inequality in (6.7), let Bq C A — B be the set of indices for 
which x'vg = Aq. It is easy to show that vguBo = vg. In other words, if we 
take B to be the largest set having a given vg proportional to its equiangular 
vector, then x'vg > A^ for j G A — B. 

Writing /i(7) = /i + 7v as in (6.2) presupposes that the Stagewise solutions 
follow a piecewise linear track. However, the presupposition can be reduced 
to one of piecewise differentiability by taking 7 infinitesimally small. We 
can always express the family of Stagewise solutions as P{z), where the 
real-valued parameter Z plays the role of T for the Lasso, increasing from 
to some maximum value as P{z) goes from to the full OLS estimate. 
[The choice Z = T used in Figure 1 may not necessarily yield a one-to-one 
mapping; Z = S{0) — S{P), the reduction in residual squared error, always 
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does.] We suppose that the Stagewise estimate P{z) is everywhere right 
differentiable with respect to z. Then the right derivative 

(6.8) ^ = (0{z)/dz 

must obey the three constraints. 

The definition of the ideahzed Stagewise procedure in Section 3.2, in which 
e — > in rule (1.7), is somewhat vague but the three constraints apply to any 
reasonable interpretation. It turns out that the LARS-Stagewise algorithm 
satisfies the constraints and is unique in doing so. This is the meaning of 
Theorem 2. [Of course the LARS-Stagewise algorithm is also supported by 
direct numerical comparisons with (1.7), as in Figure I's right panel.] 

If u^ € C_A^ then v = v_4 obviously satisfies the three constraints. The 
interesting situation for Theorem 2 is u_4 ^ C^, which we now assume to 
be the case. Any subset B C A determines a face of the convex cone of 
dimension \B\, the face having Pj > in (3.12) for j € B and Pj = for 
j € A — B. The orthogonal projection of u^ into the linear subspace C{Xi3), 
say Projg(u_4), is proportional to B^s equiangular vector ug: using (2.7), 

(6.9) Projg(u^) = XbGb'X'sUa = XbG^'Aj^Ib = (A^/Ab) ■ u^, 
or equivalently 

(6.10) Proje(v^) = {Aa/Ab)^vb. 

The nearest point to u_4 in C^, say u_4, is of the form Y,j^XjPj with Pj > 0. 
Therefore u^^ exists strictly within face B, where B = {j : Pj > 0}, and must 
equal Projg(u^). According to (6.9), u_4 is proportional to B's equiangular 
vector u^, and also to v^ = Agug. In other words v^ satisfies Constraint II, 
and it obviously also satisfies Constraint I. Figure 9 schematically illustrates 
the geometry. 

Lemma 12. The vector Vj^ satisfies Constraints I-III, and conversely if 
V satisfies the three constraints, then v = v^ . 

Proof. Let Cos = Aj^^/Aq and Sin = [1 - Cos^]^/^, the latter being 
greater than zero by Lemma 5. For any face B C A, (6.9) implies 

(6.11) u^ = Cos-ug + Sin-zg, 

where zg is a unit vector orthogonal to >C(Xg), pointing away from C^. By an 
n-dimensional coordinate rotation we can make >C(Xg) = C{ci,C2,. ■ ■ ,cj), 
J = \B\, the space of n-vectors with last n — J coordinates zero, and also 

(6.12) Uf5 = (1,0,0,0), u^ = (Cos,0,Sin,0), 
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Fig. 9. The geometry of the LARS-Stagewise modification. 



the first having length J — 1, the second length n — J — 1. Then we can 
write 

(6.13) x,=(^B,x,,,0,O) forjG^, 

the first coordinate Aq being required since x'ug = As, (2.7). Notice that 
x'u_4 = Cos -Ajs = A_A, as also required by (2.7). 
For i G A — B denote x^ as 



(6.14) 

so (2.7) yields 

(6.15) 



X^ = (x£j,X^2,Xf3,X^4), 



Aji, = x^u^ = Cos -Xi-^ + Sin -xi^ 



Now assume B = B. In this case a separating hyperplane 7i orthogonal to 
z^ in (6.11) passes between the convex cone C_a and u_4, through u_4 = 
Cos-u^, implying x^^ < [i.e., x^ and u^ are on opposite sides of TC, x^^ 
being negative since the corresponding coordinate of u_4, "Sin" in (6.12), is 
positive]. Equation (6.15) gives Cos-x^^ > A_a = Cos-74^ or 

(6.16) x^vg = x^(^gu^) = A^xe^ > ^|, 

verifying that Constraint III is satisfied. 

Conversely suppose that v satisfies Constraints I-III so that v G 5_J and 
V = V0 for the nonzero coefficient set B: vg = 'Lq-x.jPj^Pj > 0. Let H be 
the hyperplane passing through Cos ■ ug orthogonally to zg, (6.9), (6.11). If 
ve 7^ v^, then at least one of the vectors X£, I (^ A — B, must lie on the same 
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side of 7i as u^, so that x^.^ > (or else 7i would be a separating hyperplane 
between u^ and C^^, and v/3 would be proportional to u^, the nearest point 
to u_4 in Cj,, implying vg = v^). Now (6.15) gives Cos • X£^ < Aj^ = Cos • Ag, 
or 

(6.17) x^vs = x^(>1bub) = Asxi^ < Al. 

This violates Constraint III, showing that v must equal v^. D 

Notice that the direction of advance v = v^ of the idealized Stagewise 
procedure is a function only of the current maximal set A. = {j : \cj\ = C}, 
say V = (l){A). In the language of (6.7), 

(o.i«) ^^.a,. 

The LARS-Stagewise algorithm of Theorem 2 produces an evolving fam- 
ily of estimates fi that everywhere satisfies (6.18). This is true at every 
LARS-Stagewise breakpoint by the definition of the Stagewise modifica- 
tion. It is also true between breakpoints. Let A be the maximal set at the 
breakpoint, giving v = v^ = 4>{A). In the succeeding LARS-Stagewise inter- 
val /i(7) = /i + 7Vg, the maximal set is immediately reduced to S, according 
to properties (6.6), (6.7) of v^, at which it stays during the entire interval. 
However, 4>{B) = (l){A) = v^ since v^ € Cg, so the LARS-Stagewise proce- 
dure, which continues in the direction v until a new member is added to the 
active set, continues to obey the idealized Stagewise equation (6.18). 

All of this shows that the LARS-Stagewise algorithm produces a legit- 
imate version of the idealized Stagewise track. The converse of Lemma 12 
says that there are no other versions, verifying Theorem 2. 

The Stagewise procedure has its potential generality as an advantage over 
LARS and Lasso: it is easy to define forward Stagewise methods for a wide 
variety of nonlinear fitting problems, as in Hastie, Tibshirani and Friedman 
[(2001), Chapter 10, which begins with a Stagewise analysis of "boosting"]. 
Comparisons with LARS and Lasso within the linear model framework, as 
at the end of Section 3.2, help us better understand Stagewise methodology. 
This section's results permit further comparisons. 

Consider proceeding forward from /i along unit vector u, ^(7) = ^ + 7U, 
two interesting choices being the LARS direction u i and the Stagewise 
direction Ji^. For u E C{Xj^, the rate of change of 5(7) = ||y — /i(7)|p is 



,0,.) -^ 



U^A-U 



'^- A^ 



LEAST ANGLE REGRESSION 39 



(6.19) following quickly from (5.14). This shows that the LARS direction 

jcrease in S 

dSLARsil) 



u J maximizes the instantaneous decrease in S. The ratio 



(6.20) 5Sstagc(7) 



^7 



(97 



A - ' 



equaling the quantity "Cos" in (6.15). 

The comparison goes the other way for the maximum absolute correlation 
C{'y). Proceeding as in (2.15), 



(6.21) «^(-') 



= min{|x'u|}. 
A 



dCLARsil) 
97 


/'9Cstage(7) 
0/ 97 


_^A 
^B 



The argument for Lemma 12, using Constraints II and III, shows that u^ 
maximizes (6.21) at A^, and that 

(6.22) 

The original motivation for the Stagewise procedure was to minimize 
residual squared error within a framework of parsimonious forward search. 
However, (6.20) shows that Stagewise is less greedy than LARS in this re- 
gard, it being more accurate to describe Stagewise as striving to minimize 
the maximum absolute residual correlation. 

7. Computations. The entire sequence of steps in the LARS algorithm 
with m < n variables requires 0{m^ + n??!^) computations — the cost of a 
least squares fit on m variables. 

In detail, at the kth of m steps, we compute m — k inner products Cjk 
of the nonactive Xj with the current residuals to identify the next active 
variable, and then invert the k x k matrix Q^ = ^'k-^k to find the next 
LARS direction. We do this by updating the Cholesky factorization -Rfc-i 
of Gk~i found at the previous step [Golub and Van Loan (1983)]. At the 
final step m, we have computed the Cholesky R = Rm for the full cross- 
product matrix, which is the dominant calculation for a least squares fit. 
Hence the LARS sequence can be seen as a Cholesky factorization with a 
guided ordering of the variables. 

The computations can be reduced further by recognizing that the inner 
products above can be updated at each iteration using the cross-product 
matrix X' X and the current directions. For m^ n, this strategy is coun- 
terproductive and is not used. 

For the lasso modification, the computations are similar, except that oc- 
casionally one has to drop a variable, and hence downdate Rk [costing at 
most 0{m'^) operations per downdate]. For the stagewise modification of 
LARS, we need to check at each iteration that the components of w are 
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all positive. If not, one or more variables are dropped [using the inner loop 
of the NNLS algorithm described in Lawson and Hanson (1974)], again re- 
quiring downdating of Rk- With many correlated variables, the stagewise 
version can take many more steps than LARS because of frequent dropping 
and adding of variables, increasing the computations by a factor up to 5 or 
more in extreme cases. 

The LARS algorithm (in any of the three states above) works gracefully 
for the case where there are many more variables than observations: m^ n. 
In this case LARS terminates at the saturated least squares fit after n — 1 
variables have entered the active set [at a cost of 0{n'^) operations]. (This 
number is n — 1 rather than n, because the columns of X have been mean 
centered, and hence it has row-rank n — 1.) We make a few more remarks 
about the m^ n case in the lasso state: 

1. The LARS algorithm continues to provide Lasso solutions along the way, 
and the final solution highlights the fact that a Lasso fit can have no 
more than n — 1 (mean centered) variables with nonzero coefficients. 

2. Although the model involves no more than n— 1 variables at any time, 
the number of different variables ever to have entered the model during 
the entire sequence can be — and typically is — greater than n — 1. 

3. The model sequence, particularly near the saturated end, tends to be 
quite variable with respect to small changes in y. 

4. The estimation of o"^ may have to depend on an auxiliary method such 
as nearest neighbors (since the final model is saturated). We have not 
investigated the accuracy of the simple approximation formula (4.12) for 
the case m> n. 

Documented S-PLUS implementations of LARS and associated functions 
are available fromwww-stat.stanford.edu/~hastie/Papers/; the diabetes data 
also appears there. 

8. Boosting procedures. One motivation for studying the Forward Stage- 
wise algorithm is its usefulness in adaptive fitting for data mining. In partic- 
ular. Forward Stagewise ideas are used in "boosting," an important class of 
fitting methods for data mining introduced by Freund and Schapire (1997). 
These methods are one of the hottest topics in the area of machine learning, 
and one of the most effective prediction methods in current use. Boosting can 
use any adaptive fitting procedure as its "base learner" (model fitter): trees 
are a popular choice, as implemented in CART [Breiman, Friedman, Olshen and Stone 
(1984)]. 

Friedman, Hastie and Tibshirani (2000) and Friedman (2001) studied boost- 
ing and proposed a number of procedures, the most relevant to this discus- 
sion being least squares boosting. This procedure works by successive fitting 
of regression trees to the current residuals. Specifically we start with the 
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residual r = y and the fit y = 0. We fit a tree in xi,X2, . . . ,Xm to the re- 
sponse y giving a fitted tree ti (an n- vector of fitted values). Then we update 
y to y + e • ti , r to y — y and continue for many iterations. Here e is a small 
positive constant. Empirical studies show that small values of e work better 
than e = 1: in fact, for prediction accuracy "the smaller the better." The 
only drawback in taking very small values of e is computational slowness. 

A major research question has been why boosting works so well, and 
specifically why is e-shrinkage so important? To understand boosted trees 
in the present context, we think of our predictors not as our original variables 
xi,X2, . . . ,Xm, but instead as the set of all trees t^ that could be fitted to 
our data. There is a strong similarity between least squares boosting and 
Forward Stagewise regression as defined earlier. Fitting a tree to the current 
residual is a numerical way of finding the "predictor" most correlated with 
the residual. Note, however, that the greedy algorithms used in CART do 
not search among all possible trees, but only a subset of them. In addition 
the set of all trees, including a parametrization for the predicted values in 
the terminal nodes, is infinite. Nevertheless one can define idealized versions 
of least-squares boosting that look much like Forward Stagewise regression. 

Hastie, Tibshirani and Friedman (2001) noted the the striking similarity 
between Forward Stagewise regression and the Lasso, and conjectured that 
this may help explain the success of the Forward Stagewise process used in 
least squares boosting. That is, in some sense least squares boosting may be 
carrying out a Lasso fit on the infinite set of tree predictors. Note that direct 
computation of the Lasso via the LARS procedure would not be feasible in 
this setting because the number of trees is infinite and one could not compute 
the optimal step length. However, Forward Stagewise regression is feasible 
because it only need find the the most correlated predictor among the infinite 
set, where it approximates by numerical search. 

In this paper we have established the connection between the Lasso and 
Forward Stagewise regression. We are now thinking about how these results 
can help to understand and improve boosting procedures. One such idea 
is a modified form of Forward Stagewise: we find the best tree as usual, 
but rather than taking a small step in only that tree, we take a small least 
squares step in all trees currently in our model. One can show that for small 
step sizes this procedure approximates LARS; its advantage is that it can 
be carried out on an infinite set of predictors such as trees. 

APPENDIX 
A.l. Local linearity and Lemma 2. 

Conventions. We write x; with subscript / for members of the active 
set Ak- Thus x/ denotes the Ith variable to enter, being an abuse of notation 
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for s/Xj(i) = sgn(cj(i))xj(i). Expressions x^(y - /ifc_i(y)) = Ck{y) and xju^ = 
Ak clearly do not depend on which x; G Ak we choose. 

By writing j ^ Ak, we intend that both Xj and —Xj are candidates for 
inclusion at the next step. One could think of negative indices —j corre- 
sponding to "new" variables x„j = — Xj. 

The active set ^fe(y) depends on the data y. When Ak{y) is the same 
for all y in a neighborhood of yo, we say that ^A:(y) is locally fixed [at 

-4fc = A(yo)]- 

A function g{y) is locally Lipschitz at y if for all sufficiently small vectors 

Ay, 

(A.l) IIA5II = ||5(y + Ay) - g{y)\\ < L||Ay||. 

If the constant L applies for all y, we say that g is uniformly locally Lipschitz 
(L), and the word "locally" may be dropped. 

Lemma 13. For each k, < k < m, there is an open set Gk of full 
measure on which Akiy) and Ak+i{y) are locally fixed and differ by 1, and 
fikiy) is locally linear. The sets Gk are decreasing as k increases. 

Proof. The argument is by induction. The induction hypothesis states 
that for each yo € Gk-i there is a small ball i?(yo) on which (a) the active 
sets ^fc_i(y) and Ak{y) are fixed and equal to Ak-i and Ak, respectively, 
(b) \Ak \^fc_i| = 1 so that the same single variable enters locally at stage 
k — 1 and (c) /i;,„j^(y) = My is linear. We construct a set Gk with the same 
property. 

Fix a point yo and the corresponding ball -B(yo) C Gk^i, on which y — 
Afc-i(y) = y - My = Ry, say. For indices ji,J2 ^ A, let iV(ji, J2) be the set 
of y for which there exists a 7 such that 

(A.2) w'{Ry - 7Ufc) = x'^^iRy - juk) = x'j^{Ry - jUk). 

Setting Si = xi — Xj^, the first equality may be written S'lRy = jS'iUk and 
so when S'lUk ^ determines 

'y = S[Ry/d[uk=:Viy- 

[If 5'iUk = 0, there are no qualifying y, and A^(ji, J2) is empty] Now using the 
second equality and setting S2 = xi — Xj^, we see that N{ji,J2) is contained 
in the set of y for which 

62Ry = 7][y (JgU^. 

In other words, setting T72 = ^'^2 — {S'2^k)'niJ '^^ have 

N{ji,j2)c{y:r]2y = 0}. 
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If we define 

^'(yo) = [j{N{jl,J2)--jl,J2 i A,jl^J2}, 

it is evident that iV(yo) is a finite union of hyperplanes and hence closed. 
For y € -B(yo) \ ^(yo); a unique new variable joins the active set at step k. 
Near each such y the "joining" variable is locally the same and 7fc(y)ufc is 
locally linear. 

We then define Gk C G^-i as the union of such sets -B(y) \ N{y) over 
y G Gk-i- Thus Gk is open and, on Gk, Ak+i{y) is locally constant and 
fi-kiy) is locally linear. Thus properties (a)-(c) hold for Gk- 

The same argument works for the initial case k = 0: since fiQ = 0, there is 
no circularity. 

Finally, since the intersection of G^ with any compact set is covered by a 
finite number of B[yi) \ N{yi), it is clear that Gk has full measure. D 

Lemma 14. Suppose that, for y near yo, fik-iiy) is continuous {resp. 
linear) and that Ak{y) = Ak- Suppose also that, aty-Q, Ak+i{yo) =AU{k + 

!}• 

Then for y near yo, Ak+i{y) = Ak U {A; + 1} and 7fc(y) and hence /ifc(y) 
are continuous {resp. linear) and uniformly Lipschitz. 

Proof. Consider first the situation at yo, with Ck and Ckj defined 
in (2.18) and (2.17), respectively. Since k + 1 ^ Ak, we have |C'fc(yo)| > 
Cfc,fc+i(yo), and 7fc(yo) > satisfies 

'j = k + l 



(A. 3) Cfc(yo) - 7fc(yo)^fc | ^ | Cfc,i(yo) - 7fc(yo)afc,i as i ■ ^ ^ _^ j^ ■ 

In particular, it must be that Ak ^ ak^k+i, and hence 

~ , N C'fc(yo) -Cfc,fc+i(yo) ^ ^ 

7fc(yo) = 1 > 0. 

^k — ak,k+i 

Call an index j admissible if j ^ Ak and Okj 7^ Ak- For y near yo, this 
property is independent of y. For admissible j, define 

o (^^ _ Ckjy) - ck,j{y) 
^k,j[y) - — ^ , 

which is continuous (resp. linear) near yo from the assumption on fik-i- By 
definition, 

7fc(y)= min i?fcj(y), 
jePfcCy) 

where 

^fc(y) = {j admissible and RkAy) > 0}. 
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For admissible j, Rkjiyo) 7^ 0, and near yo the functions y -^ Rkjiy) are 
continuous and of fixed sign. Thus, near yo the set 'Pfc(y) stays fixed at 
'Pkiyo) and (A. 3) imphes that 

Rk,k+i{y) < Rk,j{y), j>k + i,je Vk{y). 

Consequently, for y near yo, only variable k + 1 joins the active set, and so 
A+i(y)=AU{/c + l}, and 

,^.^ - f \ n f \ (xi-Xfc+i)'(y-Afc~i(y)) 
(A.4) 7fe(y) = Rk,k+i{y) = J w • 

This representation shows that both 7fc(y) and hence Afe(y) = Afc-i(y) + 
7fc(y)ufc are continuous (resp. linear) near yo. 

To show that ^k is locally Lipschitz at y, we set S = w — Xk+i and write, 
using notation from (A.l), 

S'{Ay-Ap,,_,) 

A7fc = -, . 

d Ufc 

As y varies, there is a finite list of vectors (x/, x^^.!, u^) that can occur in the 
denominator term S u^, and since all such terms are positive [as observed 
below (A. 3)], they have a uniform positive lower bound, Omin say. Since 
||<5|| < 2 and Afc-i is Lipschitz (Lfc_i) by assumption, we conclude that 

A.2. Consequences of the positive cone condition. 

Lemma 15. Suppose that \A+\ = |^| + 1 and that X_/[^ = [Xj[ x_|_] 
{where x+ = SjXj for some j ^ A). Let P4 = Xj^G^ X'y^ denote projection 
on span(X_4), so that a = x'_,_P4X+ < 1. The +-component o/G^_,_l^+ is 



(A.5) (G^^l^+)+ = (l-a)-^ 1 



1/1 x+u^ 



Consequently, under the positive cone condition (4.11), 
(A.6) x>^ < Aa. 

Proof. Write Ga+ as a partitioned matrix 

■ X'X X'x+ \ _( A B 



^■^+-'x;x x'+x+y-vs' D 

Applying the formula for the inverse of a partitioned matrix [e.g., Rao (1973), 
page 33], 

{G'^_^1a+)+ = -E"^F'1 + E-\ 
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where 

E = D- B'A-^B = 1 - x'+P4X+, 

from which (A. 5) follows. The positive cone condition implies that G'^^l_4+ > 
0, and so (A. 6) is immediate. D 

A.3. Global continuity and Lemma 3. We shall call yo a multiple point 
at step k if two or more variables enter at the same time. Lemma 14 shows 
that such points form a set of measure zero, but they can and do cause 
discontinuities in fi^j^i at yo in general. We will see, however, that the 
positive cone condition prevents such discontinuities. 

We confine our discussion to double points, hoping that these arguments 
will be sufficient to establish the same pattern of behavior at points of mul- 
tiplicity 3 or higher. In addition, by renumbering, we shall suppose that 
indices k + 1 and k + 2 are those that are added at double point yo . Simi- 
larly, for convenience only, we assume that Ak{y) is constant near yo. Our 
task then is to show that, for y near a double point yo, both /ifc(y) and 
Afc+i(y) ^^6 continuous and uniformly locally Lipschitz. 

Lemma 16. Suppose that Akiy) = Ak is constant near yo and that 
Ak+{yo) =AkU{k + l,k + 2}. Then for y near yo, Ak+{y) \ Ak can only 
be one of three possibilities, namely {k + 1}, {k + 2} or {k + l,k + 2}. In all 
cases Afc(y) = Afe-i(y) +7fe(y)ufc as usual, and both 7fc(y) and jj-kiy) are 
continuous and locally Lipschitz. 

Proof. We use notation and tools from the proof of Lemma 14. Since 
yo is a double point and the positivity set Vk{y) = Vk near yo, we have 

< Rk,k+i{yo) = Rk,k+2(.yo) < Rk,j{yo) for j£Vk\{k + l,k + 2}. 

Continuity of Rkj implies that near yo we still have 

< Rk,k+i{y),Rk,k+2iy) < mm{Rk,jiy);j erk\{k + l,k + 2}}. 

Hence Ak-\- \ Ak must equal {k + 1} or {k + 2} or {k + l,k + 2} according 
as Rk,k+i{y) is less than, greater than or equal to Rk,k+2{y)- The continuity 
of 

7fc(y) = mhi{i?fc,fc+i(y),-Rfc,fc+2(y)} 

is immediate, and the local Lipschitz property follows from the arguments 
of Lemma 14. D 
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Lemma 17. Assume the conditions of Lemma 16 and in addition that 
the positive cone condition (4.11) holds. Then /x^^;^(y) is continuous and 
locally Lipschitz near yg. 

Proof. Since yo is a double point, property (A. 3) holds, but now with 
equality when j = k + 1 or k + 2 and strict inequality otherwise. In other 
words, there exists Sq > for which 

A f \ - / \ / = 0> if j = /c + 2, 

Ck+i{yo) - Ck+iAyo) I > 5^^ if j > A; + 2. 

Consider a neighborhood -B(yo) of yo and let iV(yo) be the set of double 
points in i?(yo), that is, those for which Ak+i{y) \ Ak = {k + l,k + 2}. 
We establish the convention that at such double points fif.^i{y) = Afc(y); 
at other points y in -B(yo), fik+iij) is defined by /ifc(y) +7fc+i(y)ufc+i as 
usual. 

Now consider those y near yo for which ^fc+i(y) \ Ak = {k + 1}, and so, 
from the previous lemma, Ak+2{y) \ -^fc+i = {k + 2}. For such y, continuity 
and the local Lipschitz property for /i^ imply that 

r (^] r (^\j = 0{\\y-yo\\), ifj = k + 2, 

It is at this point that we use the positive cone condition (via Lemma 15) 
to guarantee that Ak-\-i > ak+i^k+2- Also, since Ak+i{y) \ Ak = {k + 1}, we 
have 

Ck+i{y) >cfc+i,fc+2(y)- 

These two facts together show that A; + 2 € Vk+i(y) and hence that 

/ N C'fc+i(y) -cfc+i,fc+2(y) ^.|| ||x 

7fc+i(y) = -1 = C»(||y - yoll) 

^fc+i — afc+i,fc+2 

is continuous and locally Lipschitz. In particular, as y approaches A^(yo), 
we have 7fc+i(y) ^0. D 

Remark A. 1 . We say that a function g : M" ^ M is almost differentiable 
if it is absolutely continuous on almost all line segments parallel to the co- 
ordinate axes, and its partial derivatives (which consequently exist a.e.) are 
locally integrable. This definition of almost differentiability appears super- 
ficially to be weaker than that given by Stein, but it is in fact precisely the 
property used in his proof. Furthermore, this definition is equivalent to the 
standard definition of weak differentiability used in analysis. 
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Proof of Lemma 3. We have shown exphcitly that fi^iy) is continuous 
and uniformly locahy Lipschitz near single and double points. Similar argu- 
ments extend the property to points of multiplicity 3 and higher, and so all 
points y are covered. Finally, absolute continuity of y — > /ifc(y) on line seg- 
ments is a simple consequence of the uniform Lipschitz property, and so fif. is 
almost differentiable. 

D 
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